{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<i>Copyright (c) Microsoft Corporation. All rights reserved.</i>\n",
    "\n",
    "<i>Licensed under the MIT License.</i>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# DKN : Deep Knowledge-Aware Network for News Recommendation\n",
    "DKN \\[1\\] is a deep learning model which incorporates information from knowledge graph for better news recommendation. Specifically, DKN uses TransX \\[2\\] method for knowledge graph representaion learning, then applies a CNN framework, named KCNN, to combine entity embedding with word embedding and generate a final embedding vector for a news article. CTR prediction is made via an attention-based neural scorer. \n",
    "\n",
    "## Properties of DKN:\n",
    "- DKN is a content-based deep model for CTR prediction rather than traditional ID-based collaborative filtering. \n",
    "- It makes use of knowledge entities and common sense in news content via joint learning from semantic-level and knnowledge-level representations of news articles.\n",
    "- DKN uses an attention module to dynamically calculate a user's aggregated historical representaition.\n",
    "\n",
    "## Data format:\n",
    "One simple example: <br>\n",
    "`0 CandidateNews:1050,5238,5165,6996,1922,7058,7263,0,0,0 entity:0,0,0,0,2562,2562,2562,0,0,0 clickedNews0:7680,3978,1451,409,4681,0,0,0,0,0 entity0:0,0,0,395,0,0,0,0,0,0 clickedNews1:3698,2301,1055,6481,6242,7203,0,0,0,0 entity1:0,0,1904,1904,1904,0,0,0,0,0 ...`\n",
    "<br>\n",
    "In general, each line in data file represents one instance, the format is like: <br> \n",
    "`[label] [CandidateNews:w1,w2,w3,...] [entity:e1,e2,e3,...] [clickedNews0:w1,w2,w3,...] [entity0:e1,e2,e3,...] ...` <br>\n",
    "It contains several parts seperated by space, i.e. label part, CandidateNews part, ClickedHistory part.  CandidateNews part describes the target news article we are going to score in this instance, it is represented by (aligned) title words and title entities. To take a quick example, a news title may be : `Trump to deliver State of the Union address next week` , then the title words value may be `CandidateNews:34,45,334,23,12,987,3456,111,456,432` and the title entitie value may be: `entity:45,0,0,0,0,0,0,0,0,0`. Only the first value of entity vector is non-zero due to the word `Trump`. <br>\n",
    "clickedNewsk and entityk describe the k-th news article the user ever clicked and the format is the same as candidate news. Words and entities are aligned in news title. We use a fixed length to describe an article, if the title is less than the fixed length, just pad it with zeros."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Global settings and imports"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [],
   "source": [
    "import sys\n",
    "sys.path.append(\"../../\")\n",
    "from reco_utils.recommender.deeprec.deeprec_utils import *\n",
    "from reco_utils.recommender.deeprec.models.dkn import *\n",
    "from reco_utils.recommender.deeprec.IO.dkn_iterator import *\n",
    "import papermill as pm\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Download and load data"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_path = '../../tests/resources/deeprec/dkn'\n",
    "yaml_file = os.path.join(data_path, r'dkn.yaml')\n",
    "train_file = os.path.join(data_path, r'final_train_with_entity.txt')\n",
    "valid_file = os.path.join(data_path, r'final_test_with_entity.txt')\n",
    "wordEmb_file = os.path.join(data_path, r'word_embeddings_100.npy')\n",
    "entityEmb_file = os.path.join(data_path, r'TransE_entity2vec_100.npy')\n",
    "if not os.path.exists(yaml_file):\n",
    "    download_deeprec_resources(r'https://recodatasets.blob.core.windows.net/deeprec/', data_path, 'dknresources.zip')\n",
    "    \n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Create hyper-parameters"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "tags": [
     "parameters"
    ]
   },
   "outputs": [],
   "source": [
    "epoch=10"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[('DNN_FIELD_NUM', None), ('FEATURE_COUNT', None), ('FIELD_COUNT', None), ('MODEL_DIR', None), ('PAIR_NUM', None), ('SUMMARIES_DIR', None), ('activation', ['sigmoid']), ('attention_activation', 'relu'), ('attention_dropout', 0.0), ('attention_layer_sizes', 100), ('batch_size', 100), ('cross_activation', 'identity'), ('cross_l1', 0.0), ('cross_l2', 0.0), ('cross_layer_sizes', None), ('cross_layers', None), ('data_format', 'dkn'), ('dim', 100), ('doc_size', 10), ('dropout', [0.0]), ('dtype', 32), ('embed_l1', 0.0), ('embed_l2', 0.0005), ('enable_BN', False), ('entityEmb_file', '../../tests/resources/deeprec/dkn\\\\TransE_entity2vec_100.npy'), ('entity_dim', 100), ('entity_embedding_method', 'TransE'), ('entity_size', 3777), ('epochs', 10), ('fast_CIN_d', 0), ('filter_sizes', [1, 2, 3]), ('init_method', 'uniform'), ('init_value', 0.1), ('is_clip_norm', 0), ('iterator_type', None), ('kg_file', None), ('kg_training_interval', 5), ('layer_l1', 0.0), ('layer_l2', 0.0005), ('layer_sizes', [300]), ('learning_rate', 2e-05), ('load_model_name', None), ('load_saved_model', False), ('loss', 'log_loss'), ('lr_kg', 0.5), ('lr_rs', 1), ('max_grad_norm', 2), ('method', 'classification'), ('metrics', ['auc', 'acc', 'f1', 'logloss']), ('model_type', 'dkn'), ('mu', None), ('n_item', None), ('n_item_attr', None), ('n_user', None), ('n_user_attr', None), ('num_filters', 100), ('optimizer', 'adam'), ('reg_kg', 0.0), ('save_epoch', 2), ('save_model', False), ('show_step', 100000), ('train_ratio', None), ('transform', True), ('use_CIN_part', False), ('use_DNN_part', False), ('use_FM_part', False), ('use_Linear_part', False), ('user_clicks', None), ('user_dropout', False), ('wordEmb_file', '../../tests/resources/deeprec/dkn\\\\word_embeddings_100.npy'), ('word_size', 8033), ('write_tfevents', False)]\n"
     ]
    }
   ],
   "source": [
    "hparams = prepare_hparams(yaml_file, wordEmb_file=wordEmb_file, entityEmb_file=entityEmb_file, epochs=epoch)\n",
    "print(hparams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "input_creator = DKNTextIterator"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train the DKN model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "metadata": {},
   "outputs": [],
   "source": [
    "model = DKN(hparams, input_creator)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'auc': 0.5029, 'acc': 0.4275, 'f1': 0.0, 'logloss': 0.7688}\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "C:\\ProgramData\\Anaconda3\\envs\\reco\\lib\\site-packages\\sklearn\\metrics\\classification.py:1135: UndefinedMetricWarning: F-score is ill-defined and being set to 0.0 due to no predicted samples.\n",
      "  'precision', 'predicted', average, warn_for)\n"
     ]
    }
   ],
   "source": [
    "print(model.run_eval(valid_file))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "at epoch 1 train info: acc:0.6061, auc:0.7798, f1:0.7548, logloss:0.6705 eval info: acc:0.5725, auc:0.5484, f1:0.7281, logloss:0.6847\n",
      "at epoch 1 , train time: 98.8 eval time: 37.0\n",
      "at epoch 2 train info: acc:0.6061, auc:0.7643, f1:0.7548, logloss:0.6705 eval info: acc:0.5725, auc:0.5607, f1:0.7281, logloss:0.6846\n",
      "at epoch 2 , train time: 82.1 eval time: 36.7\n",
      "at epoch 3 train info: acc:0.6061, auc:0.7453, f1:0.7548, logloss:0.6704 eval info: acc:0.5725, auc:0.5651, f1:0.7281, logloss:0.6846\n",
      "at epoch 3 , train time: 81.9 eval time: 36.4\n",
      "at epoch 4 train info: acc:0.6061, auc:0.7339, f1:0.7548, logloss:0.6704 eval info: acc:0.5725, auc:0.5733, f1:0.7281, logloss:0.6846\n",
      "at epoch 4 , train time: 82.4 eval time: 35.1\n",
      "at epoch 5 train info: acc:0.6061, auc:0.7287, f1:0.7548, logloss:0.6704 eval info: acc:0.5725, auc:0.5839, f1:0.7281, logloss:0.6846\n",
      "at epoch 5 , train time: 81.4 eval time: 34.9\n",
      "at epoch 6 train info: acc:0.6061, auc:0.7278, f1:0.7548, logloss:0.6704 eval info: acc:0.5725, auc:0.5885, f1:0.7281, logloss:0.6846\n",
      "at epoch 6 , train time: 81.9 eval time: 34.6\n",
      "at epoch 7 train info: acc:0.6061, auc:0.729, f1:0.7548, logloss:0.6703 eval info: acc:0.5725, auc:0.5879, f1:0.7281, logloss:0.6846\n",
      "at epoch 7 , train time: 83.7 eval time: 34.7\n",
      "at epoch 8 train info: acc:0.6061, auc:0.7314, f1:0.7548, logloss:0.6703 eval info: acc:0.5725, auc:0.5888, f1:0.7281, logloss:0.6846\n",
      "at epoch 8 , train time: 82.2 eval time: 34.6\n",
      "at epoch 9 train info: acc:0.6061, auc:0.734, f1:0.7548, logloss:0.6702 eval info: acc:0.5725, auc:0.5889, f1:0.7281, logloss:0.6846\n",
      "at epoch 9 , train time: 82.1 eval time: 34.5\n",
      "at epoch 10 train info: acc:0.6061, auc:0.7376, f1:0.7548, logloss:0.67 eval info: acc:0.5725, auc:0.5899, f1:0.7281, logloss:0.6845\n",
      "at epoch 10 , train time: 83.3 eval time: 36.2\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "<reco_utils.recommender.deeprec.models.dkn.DKN at 0x10c02f9aac8>"
      ]
     },
     "execution_count": 16,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "model.fit(train_file, valid_file)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now we can test again the performance on valid set:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "{'auc': 0.5899, 'acc': 0.5725, 'f1': 0.7281, 'logloss': 0.6845}\n"
     ]
    }
   ],
   "source": [
    "res = model.run_eval(valid_file)\n",
    "print(res)\n",
    "pm.record(\"res\", res)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Reference\n",
    "\\[1\\] Wang, Hongwei, et al. \"DKN: Deep Knowledge-Aware Network for News Recommendation.\" Proceedings of the 2018 World Wide Web Conference on World Wide Web. International World Wide Web Conferences Steering Committee, 2018.<br>\n",
    "\\[2\\] Knowledge Graph Embeddings including TransE, TransH, TransR and PTransE. https://github.com/thunlp/KB2E"
   ]
  }
 ],
 "metadata": {
  "celltoolbar": "Tags",
  "kernelspec": {
   "display_name": "Python (reco_bare)",
   "language": "python",
   "name": "reco_bare"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
